-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: graph: support int4/int8 compression for K/V in fused SDPA #2041
base: rfcs
Are you sure you want to change the base?
Conversation
943a6b3
to
83217a2
Compare
inputs should be `1d` tensors. | ||
2. For `per_group` quantization, all dimensions should match the input, | ||
excepts for the dimension where grouped quantization applies, which should | ||
be `src_dim / group_size`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how that matches the case when Batch dimension should be broadcasted for scales/zero-points:
W = [B, K, N]
pre-scale W: [1, gK, N] x [B, K, N] = W'
matmul: [B, M, K] x W' = [B, M, N]
Use case I've seen are like that, batch dimension doesn't receive its own dimension of scales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing the case. According to potential request from IPEX and per-token quantization, we decide to update the scale/zp input shape requirement for per-group quantization for scalability and flexibility. One potential option is to allow 1
on dimensions other than the last two, K
and N
.
1d5e1d2
to
d633cf7
Compare
04a94db
to
5b84ad0
Compare
/// 4-bit signed integer. | ||
dnnl_s4 = 11, | ||
/// 4-bit unsigned integer. | ||
dnnl_u4 = 12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For API completeness, I feel yes we need to add both s4 and u4. Just for my curiosity, do you know if both s4 and u4 are used for this int4 K/V compression request? If both are used, any difference in quantization recipe between s4 and u4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to user request, they are likely to use s4
for KV storage. But as the KV compression is still WIP, it might change in the future. Regarding the difference between u4
and s4
recipes, since int4 data types always use asymmetric quantization, the parameters will be quite similar for u4
and s4
. The difference should be mainly related to the de-quantization logic.
5b84ad0
to
6f68c51
Compare
6f68c51
to
f1c5003
Compare
c6ed1d1
to
f1c0722
Compare
f1c0722
to
a8a3a20
Compare
4aae2bf
to
0ac893e
Compare
0ac893e
to
810ef3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the document. I left some minor comments.
are required for each hidden dimension to maintain the model accuracy. The | ||
requirement for grouped zero points is not promised. See [int4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement for grouped zero points is not promised.
Could you, please, elaborate on the meaning of the phrase and how it affects the proposal.
|
||
1. Add `per_group` to the supported values of `qtype` attribute, and the default | ||
value will be unchanged. | ||
2. The existing attribute `axis` will ignored if `per_group` quantization is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. The existing attribute `axis` will ignored if `per_group` quantization is | |
2. The existing attribute `axis` will be ignored if `per_group` quantization is |
each quantization group( which is `group_size` ). And the other dimensions | ||
should be all `1`. The attribute is required when `per_group` quantization | ||
type is specified for `qtype`. If `per_group` quantization is not specified | ||
and `group_shape` attribute is given, it will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it's implicitly assumed that the number of dimensions in the group_shape
attribute coincides with the number of dimensions of the input shape. It would good to explicitly word it out if that's the case.
const size_t nG = 2, G = N / nG; | ||
|
||
dims src_dims = {K, N}; | ||
dims scale_dims = {K, N/G}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dims scale_dims = {K, N/G}; | |
dims scale_dims = {K, nG}; |
// Specify the quantization type as per_group quantization. | ||
deq.set_attr<std::string>(op::attr::qtype, "per_group"); | ||
// Axis indicates on which dimension the quantization will be applied. | ||
deq.set_attr<int64_t>(op::attr::axis, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is not needed any longer.
// ... | ||
``` | ||
|
||
##### Limitation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##### Limitation | |
#### Limitation |
Similar to the discussion of int8 and fp8 data types, an alternative solution | ||
may be supporting int4 data types directly in computation operations like MatMul | ||
and Convolution. But it will bloat and complicate the opset and op schema of | ||
oneDNN Graph. Hence it's not considered here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this paragraph is not quite relevant as the path for integer data types was already paved and it won't change.
I'd rather replace it with why a vector over a single value + axis combination as it was initially. This is more valuable information for future reference.
In addition, to enhance clarity and reduce compilation overhead, oneDNN Graph | ||
will support the direct quantization between int4 data types and bf16/f16. | ||
This means that additional `Typecast` operations will no longer be necessary | ||
between quantization processes and subsequent operations in bf16/f16 computation | ||
graphs. Although it will complicate the fusion patterns to some extend, we can | ||
address this by implementing more precise pattern definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this paragraph belongs to a Proposal 3
section instead since it already has TypeCast dropped from the picture.
Description
This is to propose to support int4/int8 compression for K/V in fused SDPA.
Link to the rendered document.